source('misc_functions/functions.R')
library(tidyverse)
[37m── [1mAttaching packages[22m ────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──[39m
[37m[32m✔[37m [34mggplot2[37m 3.1.0 [32m✔[37m [34mpurrr [37m 0.2.5
[32m✔[37m [34mtibble [37m 1.4.2 [32m✔[37m [34mdplyr [37m 0.7.8
[32m✔[37m [34mtidyr [37m 0.8.2 [32m✔[37m [34mstringr[37m 1.3.1
[32m✔[37m [34mreadr [37m 1.1.1 [32m✔[37m [34mforcats[37m 0.3.0[39m
[37m── [1mConflicts[22m ───────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[37m [34mdplyr[37m::[32mfilter()[37m masks [34mstats[37m::filter()
[31m✖[37m [34mdplyr[37m::[32mlag()[37m masks [34mstats[37m::lag()[39m
library(plotly)
Attaching package: ‘plotly’
The following object is masked from ‘package:ggplot2’:
last_plot
The following object is masked from ‘package:stats’:
filter
The following object is masked from ‘package:graphics’:
layout
library(modelr)
library(mgcv)
Loading required package: nlme
Attaching package: ‘nlme’
The following object is masked from ‘package:dplyr’:
collapse
This is mgcv 1.8-25. For overview type 'help("mgcv-package")'.
The data set has been constructed using average Science scores by country from the Programme for International Student Assessment (PISA) 2006, along with GNI per capita (Purchasing Power Parity, 2005 dollars), Educational Index, Health Index, and Human Development Index from UN data. The key variables are as follows:
The first thing to do is get the data in and do some initial inspections.
We can look at the individual relationships of covariates with overall science score.
dmelt = pisa %>%
select(-Evidence, -Explain, -Issues) %>%
gather(key=Variable,
value=Value,
-Overall, -Country)
ggplot(aes(x=Value,y=Overall), data=dmelt) +
geom_point(color='#ff5500',alpha=.75) +
geom_smooth(se=F, lwd=.5, color='#00aaff') +
geom_text(aes(label=Country), alpha=0, size=1,angle=30, hjust=-.2, # making transparent so only plotly will show the country
vjust=-.2) +
facet_wrap(~Variable, scales='free_x') +
labs(x='') +
theme_trueMinimal()
We will start with the simple situation of a single predictor. Let’s begin by using a typical linear regression to predict science scores by the Income index. We could use the standard R lm function, but I’ll leave that as an exercise for comparison. We can still do straightforward linear models with the gam function, and again it is important to note that the standard linear model can be seen as a special case of a GAM.
Family: gaussian
Link function: identity
Formula:
Overall ~ Income
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 204.32 35.37 5.777 4.32e-07 ***
Income 355.85 46.79 7.606 5.36e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.518 Deviance explained = 52.7%
GCV = 1504.5 Scale est. = 1448.8 n = 54
Let’s now fit an actual generalized additive model using the same cubic spline as our basis. We again use the gam function as before for basic model fitting, but now we are using a function s within the formula to denote the smooth terms. Within that function we also specify the type of smooth, though a default is available. I chose bs = cr, denoting cubic regression splines.
mod_gam1 <- gam(Overall ~ s(Income, bs="cr"), data=pisa)
summary(mod_gam1)
Family: gaussian
Link function: identity
Formula:
Overall ~ s(Income, bs = "cr")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 470.444 4.082 115.3 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(Income) 6.895 7.741 16.67 1.59e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.7 Deviance explained = 73.9%
GCV = 1053.7 Scale est. = 899.67 n = 54
In the summary, we first see the distribution assumed, as well as the link function used, in this case normal and identity, respectively, which to iterate, had we had no smoothing, would result in a standard linear model.
We have two general pieces of output:
In this case, the only parametric component is the intercept, but it’s good to remember that you are not bound to smooth every effect of interest, and indeed, as we will discuss in more detail later, part of the process may involve refitting the model with terms that were found to be linear for the most part anyway. The smooth component of our model regarding a country’s income and its relationship with overall science score suggests that it is statistically significant, but there are a couple of things in the model summary that would be unfamiliar. You may consult the primary document for details, but the main thing is that you can interpret it like any result you’d see in a regression table. The statistical significance is an approximation though, so again, let visualization be your guide.
Since this is in the end a linear model with a normal distribution for the target, we can use the R^2 value as we would with lm. Note that it already is adjusted for you. The deviance explained is the unadjusted R^2 in this case, but generalizes to other models that don’t assume the normal distribution. The scale estimate is our residual sums of squares from the standard regression setting, and the GCV is a better estimate of what that would be on held-out (i.e. new) data. The point is that you are seeing the usual output you have with other models, just with different labels.
The primary way to interpret the effect of income is visually though. The mgcv package will provide a basic plot as follows:
plot(mod_gam1)
What you’re seeing is a component plot with the y/fitted values centered. Almost no one finds this intuitive in practice, nor is it usually what they’d report. Instead we can get predictions at key values of other covariates (e.g. held at their mean or reference category). In this case with only one predictor, we just get the predicted values on the original scale of the outcome. We can use the ggeffects package for this (or see my visibly package).
library(ggeffects)
plot_dat <- ggpredict(mod_gam1, terms = "Income")
ggplot(plot_dat, aes(x = x, y = predicted)) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = .25) +
geom_line(color = 'dodgerblue') +
labs(x = 'Income')
We can compare models via AIC or the GCV coefficient, both of which focus on the predictive capabilities of the model on new data.
AIC(mod_lm, mod_gam1)
Likelihood ratio test (approximate).
anova(mod_lm, mod_gam1, test="Chisq")
Analysis of Deviance Table
Model 1: Overall ~ Income
Model 2: Overall ~ s(Income, bs = "cr")
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 52.000 75336
2 45.259 41479 6.7411 33857 2.778e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
mod_lm2 <- gam(Overall ~ Income + Edu + Health, data=pisa)
summary(mod_lm2)
Family: gaussian
Link function: identity
Formula:
Overall ~ Income + Edu + Health
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 121.18 78.97 1.535 0.1314
Income 182.32 85.27 2.138 0.0376 *
Edu 234.11 54.78 4.274 9.06e-05 ***
Health 27.01 134.90 0.200 0.8421
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.616 Deviance explained = 63.9%
GCV = 1212.3 Scale est. = 1119 n = 52
mod_gam2 <- gam(Overall ~ s(Income) + s(Edu) + s(Health), data=pisa)
summary(mod_gam2)
Family: gaussian
Link function: identity
Formula:
Overall ~ s(Income) + s(Edu) + s(Health)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 471.154 2.772 170 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(Income) 7.593 8.415 8.826 1.33e-07 ***
s(Edu) 6.204 7.178 3.308 0.00733 **
s(Health) 1.000 1.000 2.736 0.10661
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.863 Deviance explained = 90.3%
GCV = 573.83 Scale est. = 399.5 n = 52
We can see the fit appears to be better. In addition, as the effective degrees of freedom is 1 for Health, it is essentially a linear fit, so if desired, we can leave it among the parametric terms.
# plot(mod_gam2)
library(patchwork) # devtools::install_github("thomasp85/patchwork")
g1 =
ggpredict(mod_gam2, terms = "Income") %>%
ggplot(aes(x = x, y = predicted)) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = .25) +
geom_line(color = 'dodgerblue') +
labs(x = 'Income')
g2 =
ggpredict(mod_gam2, terms = "Edu") %>%
ggplot(aes(x = x, y = predicted)) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = .25) +
geom_line(color = 'dodgerblue') +
labs(x = 'Edu')
g3 =
ggpredict(mod_gam2, terms = "Health") %>%
ggplot(aes(x = x, y = predicted)) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = .25) +
geom_line(color = 'dodgerblue') +
labs(x = 'Health')
(g2 + g3 + g1 + plot_layout(nrow = 2)) * theme_trueMinimal()
mod_gam3 <- gam(Overall ~ Health + te(Income, Edu), data=pisa)
summary(mod_gam3)
Family: gaussian
Link function: identity
Formula:
Overall ~ Health + te(Income, Edu)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 574.5 101.4 5.666 1.42e-06 ***
Health -115.8 113.5 -1.020 0.314
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
te(Income,Edu) 10.31 12.41 9.084 2.43e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.804 Deviance explained = 84.8%
GCV = 747.52 Scale est. = 570.59 n = 52
# requires the visibly package, otherwise see the main document;
# devtools::install_github("m-clark/visibly")
visibly::plot_gam_3d(model = mod_gam3, main_var = Income, second_var = Edu, palette='bilbao', direction = 1)
AIC(mod_lm2, mod_gam2)